200 research outputs found

    Sequence information gain based motif analysis

    Get PDF
    Background: The detection of regulatory regions in candidate sequences is essential for the understanding of the regulation of a particular gene and the mechanisms involved. This paper proposes a novel methodology based on information theoretic metrics for finding regulatory sequences in promoter regions. Results: This methodology (SIGMA) has been tested on genomic sequence data for Homo sapiens and Mus musculus. SIGMA has been compared with different publicly available alternatives for motif detection, such as MEME/MAST, Biostrings (Bioconductor package), MotifRegressor, and previous work such Qresiduals projections or information theoretic based detectors. Comparative results, in the form of Receiver Operating Characteristic curves, show how, in 70 % of the studied Transcription Factor Binding Sites, the SIGMA detector has a better performance and behaves more robustly than the methods compared, while having a similar computational time. The performance of SIGMA can be explained by its parametric simplicity in the modelling of the non-linear co-variability in the binding motif positions. Conclusions: Sequence Information Gain based Motif Analysis is a generalisation of a non-linear model of the cis-regulatory sequences detection based on Information Theory. This generalisation allows us to detect transcription factor binding sites with maximum performance disregarding the covariability observed in the positions of the training set of sequences. SIGMA is freely available to the public at http://b2slab.upc.edu.Postprint (published version

    Applicability of semi-supervised learning assumptions for gene ontology terms prediction

    Get PDF
    Gene Ontology (GO) is one of the most important resources in bioinformatics, aiming to provide a unified framework for the biological annotation of genes and proteins across all species. Predicting GO terms is an essential task for bioinformatics, but the number of available labelled proteins is in several cases insufficient for training reliable machine learning classifiers. Semi-supervised learning methods arise as a powerful solution that explodes the information contained in unlabelled data in order to improve the estimations of traditional supervised approaches. However, semi-supervised learning methods have to make strong assumptions about the nature of the training data and thus, the performance of the predictor is highly dependent on these assumptions. This paper presents an analysis of the applicability of semi-supervised learning assumptions over the specific task of GO terms prediction, focused on providing judgment elements that allow choosing the most suitable tools for specific GO terms. The results show that semi-supervised approaches significantly outperform the traditional supervised methods and that the highest performances are reached when applying the cluster assumption. Besides, it is experimentally demonstrated that cluster and manifold assumptions are complimentary to each other and an analysis of which GO terms can be more prone to be correctly predicted with each assumption, is provided.Postprint (published version

    Synthetic benchmarks for machine olfaction: Classification, segmentation and sensor damage

    Get PDF
    The design of the signal and data processing algorithms requires a validation stage and some data relevant for a validation procedure. While the practice to share public data sets and make use of them is a recent and still on-going activity in the community, the synthetic benchmarks presented here are an option for the researches, who need data for testing and comparing the algorithms under development. The collection of synthetic benchmark data sets were generated for classification, segmentation and sensor damage scenarios, each defined at 5 difficulty levels. The published data are related to the data simulation tool, which was used to create a virtual array of 1020 sensors with a default set of parametersPostprint (published version

    Use of Gene Ontology semantic information in protein interaction data visualization

    Get PDF
    The Gene Ontology project is an effort to structure knowledge on biological products and processes by adding semantic information to them. This is done in a systematic way so that this additional information can be automatically processed. In this contribution a protein-protein interaction visualization algorithm is proposed, which combines protein interaction data with Gene Ontology semantic information. The information is integrated using a semantic distance measure defined in ontologies or taxonomies. Multidimensional scaling is applied to this measure and the output complements protein interaction data in building an interaction visualization map.Peer ReviewedPostprint (published version

    Applying learning analytics to students’ interaction in business simulation games. The usefulness of learning analytics to know what students really learn

    Get PDF
    Nowadays, different ways of communication and interaction among multiple actors are dominating learning processes. However, there are critical opinions that question the contribution of student interaction to real learning. This study applies learning analytics and data mining techniques to explore the online discussion forums of 362 business students at the bachelor and master levels, who participated in business simulation games between 2011 and 2016. The findings revealed that the most frequent contents in the students’ online discussion forums were related, firstly, to the parameters and features of the business simulation game, and, secondly, to elements that fostered the students’ learning process, while small talk or regular conversation did not appear to be relevant. In addition, the contents with predictive power over learning results were related to uncertainty, time, interaction, communication and collaboration, although none of these elements influenced teacher assessment of student learning. This study reveals the usefulness of learning analytics tools to gain a more wide and holistic view of the learning process of students, discovering new aspects that affect students’ learning resultsPeer ReviewedPostprint (author's final draft

    Automated resolution of chromatographic signals by independent component analysis-orthogonal signal deconvolution in comprehensive gas chromatography/mass spectrometry-based metabolomics

    Get PDF
    Comprehensive gas chromatography-mass spectrometry (GC x GC-MS) provides a different perspective in metabolomics profiling of samples. However, algorithms for GCx GC-MS data processing are needed in order to automatically process the data and extract the purest information about the compounds appearing in complex biological samples. This study shows the capability of independent component analysis-orthogonal signal deconvolution (ICA-OSD), an algorithm based on blind source separation and distributed in an R package called osd, to extract the spectra of the compounds appearing in GCx GC-MS chromatograms in an automated manner. We studied the performance of ICA-OSD by the quantification of 38 metabolites through a set of 20 Jurkat cell samples analyzed by GCx GC-MS. The quantification by ICA-OSD was compared with a supervised quantification by selective ions, and most of the R2 coefficients of determination were in good agreement (R-2>0.90) while up to 24 cases exhibited an excellent linear relation (R-2>0.95). We concluded that ICA-OSD can be used to resolve co-eluted compounds in GC x GC-MS. (C) 2016 Elsevier Ireland Ltd. All rights reserved.Postprint (author's final draft

    Training-induced gene expression plasticity in cardiac function and neural regulation for ultra-trail runners

    Get PDF
    This study aims to assess the gene regulatory response from a group of 16 athletes and to observe the plasticity induced by their training regime on the gene expression response after their participation in an 82km race. Blood samples for differential gene expression (DGE) were collected before and after this effort from two groups of runners with different training regimes: elite and active. Analyses only focused on genes annotated as related to cardiac function (CF) and neural regulation (NR) from the KEGG PATHWAY Database. Thus, 13 pathways were considered accounting for a total of 629 genes. Training regime modulated the response to exercise based on a list of 18 ranked genes with significant DGE for elite runners while remained statistically insignificant for active athletes. UQCR11, COX7C and COX4I1 genes, related to mitochondrial respiratory chain, were down-regulated which may indicate mitochondrial function impairment in cardiac muscle. Increased expression levels were obtained for PIK3R2, PLCG2, IRAK3 genes from the positive signaling cascades of neurotrophins pathway, which may reveal an improved heart rate control thanks to a better cardiac sympathetic innervation.Postprint (author's final draft

    Benchmarking network propagation methods for disease gene identification

    Get PDF
    In-silico identification of potential target genes for disease is an essential aspect of drug target discovery. Recent studies suggest that successful targets can be found through by leveraging genetic, genomic and protein interaction information. Here, we systematically tested the ability of 12 varied algorithms, based on network propagation, to identify genes that have been targeted by any drug, on gene-disease data from 22 common non-cancerous diseases in OpenTargets. We considered two biological networks, six performance metrics and compared two types of input gene-disease association scores. The impact of the design factors in performance was quantified through additive explanatory models. Standard cross-validation led to over-optimistic performance estimates due to the presence of protein complexes. In order to obtain realistic estimates, we introduced two novel protein complex-aware cross-validation schemes. When seeding biological networks with known drug targets, machine learning and diffusion-based methods found around 2-4 true targets within the top 20 suggestions. Seeding the networks with genes associated to disease by genetics decreased performance below 1 true hit on average. The use of a larger network, although noisier, improved overall performance. We conclude that diffusion-based prioritisers and machine learning applied to diffusion-based features are suited for drug discovery in practice and improve over simpler neighbour-voting methods. We also demonstrate the large impact of choosing an adequate validation strategy and the definition of seed disease genesPeer ReviewedPostprint (published version

    Evaluation of cross-validation strategies in sequence-based binding prediction using deep learning

    Get PDF
    Binding prediction between targets and drug-like compounds through deep neural networks has generated promising results in recent years, outperforming traditional machine learning-based methods. However, the generalization capability of these classification models is still an issue to be addressed. In this work, we explored how different cross-validation strategies applied to data from different molecular databases affect to the performance of binding prediction proteochemometrics models. These strategies are (1) random splitting, (2) splitting based on K-means clustering (both of actives and inactives), (3) splitting based on source database, and (4) splitting based both in the clustering and in the source database. These schemas are applied to a deep learning proteochemometrics model and to a simple logistic regression model to be used as baseline. Additionally, two different ways of describing molecules in the model are tested: (1) by their SMILES and (2) by three fingerprints. The classification performance of our deep learning-based proteochemometrics model is comparable to the state of the art. Our results show that the lack of generalization of these models is due to a bias in public molecular databases and that a restrictive cross-validation schema based on compound clustering leads to worse but more robust and credible results. Our results also show better performance when representing molecules by their fingerprints.Peer ReviewedPostprint (author's final draft

    Non-targeted metabolomics reveals alterations in liver and plasma of gilt-head bream exposed to oxybenzone

    Get PDF
    The extensive use of the organic UV filter oxybenzone has led to its ubiquitous occurrence in the aquatic environment, causing an ecotoxicological risk to biota. Although some studies reported adverse effects, such as reproductive toxicity, further research needs to be done in order to assess its molecular effects and mechanism of action. Therefore, in the present work, we investigated metabolic perturbations in juvenile gilt-head bream (Sparus aurata) exposed over 14 days via the water to oxybenzone (50¿mg/L). The non-targeted analysis of brain, liver and plasma extracts was performed by means of UHPLC-qOrbitrap MS in positive and negative modes with both C18 and HILIC separation. Although there was no mortality or alterations in general physiological parameters during the experiment, and the metabolic profile of brain was not affected, the results of this study showed that oxybenzone could perturb both liver and plasma metabolome. The pathway enrichment suggested that different pathways in lipid metabolism (fatty acid elongation, a-linolenic acid metabolism, biosynthesis of unsaturated fatty acids and fatty acid metabolism) were significantly altered, as well as metabolites involved in phenylalanine and tyrosine metabolism. Overall, these changes are signs of possible oxidative stress and energy metabolism modification. Therefore, this research indicates that oxybenzone has adverse effects beyond the commonly studied hormonal activity, and demonstrates the sensitivity of metabolomics to assess molecular-level effects of emerging contaminants.Peer ReviewedPostprint (author's final draft
    corecore